{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Linear models for classification\n",
    "\n",
    "In regression, we saw that the target to be predicted is a continuous\n",
    "variable. In classification, the target is discrete (e.g. categorical).\n",
    "\n",
    "In this notebook we go back to the penguin dataset. However, this time the\n",
    "task is to predict the penguin species using the culmen information. We also\n",
    "simplify our classification problem by selecting only 2 of the penguin species\n",
    "to solve a binary classification problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n",
    "\n",
    "# only keep the Adelie and Chinstrap classes\n",
    "penguins = (\n",
    "    penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n",
    ")\n",
    "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n",
    "target_column = \"Species\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can quickly start by visualizing the feature distribution by class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "for feature_name in culmen_columns:\n",
    "    plt.figure()\n",
    "    # plot the histogram for each specie\n",
    "    penguins.groupby(\"Species\")[feature_name].plot.hist(alpha=0.5, legend=True)\n",
    "    plt.xlabel(feature_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can observe that we have quite a simple problem. When the culmen length\n",
    "increases, the probability that the penguin is a Chinstrap is closer to 1.\n",
    "However, the culmen depth is not helpful for predicting the penguin species.\n",
    "\n",
    "For model fitting, we separate the target from the data and we create a\n",
    "training and a testing set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "penguins_train, penguins_test = train_test_split(penguins, random_state=0)\n",
    "\n",
    "data_train = penguins_train[culmen_columns]\n",
    "data_test = penguins_test[culmen_columns]\n",
    "\n",
    "target_train = penguins_train[target_column]\n",
    "target_test = penguins_test[target_column]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The linear regression that we previously saw predicts a continuous output.\n",
    "When the target is a binary outcome, one can use the logistic function to\n",
    "model the probability. This model is known as logistic regression.\n",
    "\n",
    "Scikit-learn provides the class `LogisticRegression` which implements this\n",
    "algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\n",
    "logistic_regression.fit(data_train, target_train)\n",
    "accuracy = logistic_regression.score(data_test, target_test)\n",
    "print(f\"Accuracy on test set: {accuracy:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since we are dealing with a classification problem containing only 2 features,\n",
    "it is then possible to observe the decision function boundary. The boundary is\n",
    "the rule used by our predictive model to affect a class label given the\n",
    "feature values of the sample.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p>Here, we use the class <tt class=\"docutils literal\">DecisionBoundaryDisplay</tt>. This educational tool allows\n",
    "us to gain some insights by plotting the decision function boundary learned by\n",
    "the classifier in a 2 dimensional feature space.</p>\n",
    "<p class=\"last\">Notice however that in more realistic machine learning contexts, one would\n",
    "typically fit on more than two features at once and therefore it would not be\n",
    "possible to display such a visualization of the decision boundary in\n",
    "general.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "from sklearn.inspection import DecisionBoundaryDisplay\n",
    "\n",
    "DecisionBoundaryDisplay.from_estimator(\n",
    "    logistic_regression,\n",
    "    data_test,\n",
    "    response_method=\"predict\",\n",
    "    cmap=\"RdBu_r\",\n",
    "    alpha=0.5,\n",
    ")\n",
    "sns.scatterplot(\n",
    "    data=penguins_test,\n",
    "    x=culmen_columns[0],\n",
    "    y=culmen_columns[1],\n",
    "    hue=target_column,\n",
    "    palette=[\"tab:red\", \"tab:blue\"],\n",
    ")\n",
    "_ = plt.title(\"Decision boundary of the trained\\n LogisticRegression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus, we see that our decision function is represented by a straight line\n",
    "separating the 2 classes.\n",
    "\n",
    "For the mathematically inclined reader, the equation of the decision boundary\n",
    "is:\n",
    "\n",
    "    coef0 * x0 + coef1 * x1 + intercept = 0\n",
    "\n",
    "where `x0` is `\"Culmen Length (mm)\"` and `x1` is `\"Culmen Depth (mm)\"`.\n",
    "\n",
    "This equation is equivalent to (assuming that `coef1` is non-zero):\n",
    "\n",
    "    x1 = coef0 / coef1 * x0 - intercept / coef1\n",
    "\n",
    "which is the equation of a straight line.\n",
    "\n",
    "Since the line is oblique, it means that both coefficients (also called\n",
    "weights) are non-null:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coefs = logistic_regression[-1].coef_[0]\n",
    "weights = pd.Series(coefs, index=[f\"Weight for '{c}'\" for c in culmen_columns])\n",
    "weights"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can [access pipeline\n",
    "steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)\n",
    "by name or position. In the code above `logistic_regression[-1]` means the\n",
    "last step of the pipeline. Then you can access the attributes of that step such\n",
    "as `coef_`. Notice also that the `coef_` attribute is an array of shape (1,\n",
    "`n_features`) and then we access it via its first entry. Alternatively one\n",
    "could use `coef_.ravel()`.\n",
    "\n",
    "We are now ready to visualize the weight values as a barplot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "weights.plot.barh()\n",
    "_ = plt.title(\"Weights of the logistic regression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If one of the weights had been zero, the decision boundary would have been\n",
    "either horizontal or vertical.\n",
    "\n",
    "Furthermore the intercept is also non-zero, which means that the decision does\n",
    "not go through the point with (0, 0) coordinates.\n",
    "\n",
    "## (Estimated) predicted probabilities\n",
    "\n",
    "The `predict` method in classification models returns what we call a \"hard\n",
    "class prediction\", i.e. the most likely class a given data point would belong\n",
    "to. We can confirm the intuition given by the `DecisionBoundaryDisplay` by\n",
    "testing on a hypothetical `sample`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_penguin = pd.DataFrame(\n",
    "    {\"Culmen Length (mm)\": [45], \"Culmen Depth (mm)\": [17]}\n",
    ")\n",
    "logistic_regression.predict(test_penguin)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, our logistic regression classifier predicts the Chinstrap\n",
    "species. Note that this agrees with the decision boundary plot above: the\n",
    "coordinates of this test data point match a location close to the decision\n",
    "boundary, in the red region.\n",
    "\n",
    "As mentioned in the introductory slides \ud83c\udfa5 **Intuitions on linear models**,\n",
    "one can alternatively use the `predict_proba` method to compute continuous\n",
    "values (\"soft predictions\") that correspond to an estimation of the confidence\n",
    "of the target belonging to each class.\n",
    "\n",
    "For a binary classification scenario, the logistic regression makes both hard\n",
    "and soft predictions based on the [logistic\n",
    "function](https://en.wikipedia.org/wiki/Logistic_function) (also called\n",
    "sigmoid function), which is S-shaped and maps any input into a value between 0\n",
    "and 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_proba = logistic_regression.predict_proba(test_penguin)\n",
    "y_pred_proba"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "More in general, the output of `predict_proba` is an array of shape\n",
    "(`n_samples`, `n_classes`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_proba.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also notice that the sum of (estimated) predicted probabilities across classes\n",
    "is 1.0 for each given sample. We can visualize them for our `test_penguin` as\n",
    "follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_proba_sample = pd.Series(\n",
    "    y_pred_proba.ravel(), index=logistic_regression.classes_\n",
    ")\n",
    "y_proba_sample.plot.bar()\n",
    "plt.ylabel(\"Estimated probability\")\n",
    "_ = plt.title(\"Probability of the sample belonging to a penguin class\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition warning alert alert-danger\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Warning</p>\n",
    "<p class=\"last\">We insist that the output of <tt class=\"docutils literal\">predict_proba</tt> are just estimations. Their\n",
    "reliability on being a good estimate of the true conditional class-assignment\n",
    "probabilities depends on the quality of the model. Even classifiers with a\n",
    "high accuracy on a test set may be overconfident for some individuals and\n",
    "underconfident for others.</p>\n",
    "</div>\n",
    "\n",
    "Similarly to the hard decision boundary shown above, one can set the\n",
    "`response_method` to `\"predict_proba\"` in the `DecisionBoundaryDisplay` to\n",
    "rather show the confidence on individual classifications. In such case the\n",
    "boundaries encode the estimated probablities by color. In particular, when\n",
    "using [matplotlib diverging\n",
    "colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html#diverging)\n",
    "such as `\"RdBu_r\"`, the softer the color, the more unsure about which class to\n",
    "choose (the probability of 0.5 is mapped to white color).\n",
    "\n",
    "Equivalently, towards the tails of the curve the sigmoid function approaches\n",
    "its asymptotic values of 0 or 1, which are mapped to darker colors. Indeed,\n",
    "the closer the predicted probability is to 0 or 1, the more confident the\n",
    "classifier is in its predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DecisionBoundaryDisplay.from_estimator(\n",
    "    logistic_regression,\n",
    "    data_test,\n",
    "    response_method=\"predict_proba\",\n",
    "    cmap=\"RdBu_r\",\n",
    "    alpha=0.5,\n",
    ")\n",
    "sns.scatterplot(\n",
    "    data=penguins_test,\n",
    "    x=culmen_columns[0],\n",
    "    y=culmen_columns[1],\n",
    "    hue=target_column,\n",
    "    palette=[\"tab:red\", \"tab:blue\"],\n",
    ")\n",
    "_ = plt.title(\"Predicted probability of the trained\\n LogisticRegression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For multi-class classification the logistic regression uses the [softmax\n",
    "function](https://en.wikipedia.org/wiki/Softmax_function) to make predictions.\n",
    "Giving more details on that scenario is beyond the scope of this MOOC.\n",
    "\n",
    "In any case, interested users are referred to the [scikit-learn user guide](\n",
    "https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)\n",
    "for a more mathematical description of the `predict_proba` method of the\n",
    "`LogisticRegression` and the respective normalization functions."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}